Bioinformatics (Thomas Dandekar, Meik Kunz)

173

13.1

Transcription Code

The first code, transcription, determines when and how intensively a gene is read, in par

ticular on the basis of promoter sequences.

Transcription factor binding sites are encoded by short nucleotide sequences, several of

which act together to regulate readout in nucleated cells.

Some programs that can examine a promoter in more detail, such as TESS and

Genomatix, have already been introduced in Chap. 11. However, there are also other data

bases, such as TRANSFAC (https://www.gene-regulation.com/pub/databases.html),

MotifMap (https://motifmap.igb.uci.edu/) and JASPAR (https://jaspar.genereg.net/). Some

of these are publicly available for reading and searching transcription factor binding sites.

However, some have now become commercial and are no longer free to use.

But the closer one looks, the more unclear the transcription code is, in particular which

transcription factors that are still unidentified must also be taken into account, but also more

distant sequences that lead to increased (“enhancer”) or decreased transcription (“silencer”).

RNA Codes

However, the next step, the processing and splicing of precursor RNA, also follows its own

codes. Here, the splicing sequences that distinguish between intron and exon have already

been relatively well characterized. But it turns out that each organism has its own dialect

for deciding what to splice and how. A good program that is adaptive and species-specific

in predicting such sequences is the Augustus program (https://bioinf.uni-greifswald.de/

augustus/). It can be specially trained on new species and uses hidden Markov models for

prediction (Stanke et al. 2008).

However, one can look at numerous other codes in the RNA, in particular sequences

that decide whether the mRNA leaves the nucleus or not (in the case of mRNA in general

only one modified nucleotide, the 7-methylguanosine cap) and numerous other sequences

that regulate the translation, localisation as well as stability of the RNA (see first part; a

standard program to read these codes is the RNAAnalyzer: https://rnaanalyzer.bioapps.

biozentrum.uni-wuerzburg.de).

Protein Codes

Once the protein has been translated according to the genetic code, the question arises as

to whether it is modified post-translationally, i.e. whether sugar residues (e.g. aspartic acid

residues), lipids or acetate groups (e.g. lysine residues) are added to individual amino acids.

Next, based on its sequence, the protein folds into a molten globule (“molten globule”)

usually within milliseconds after its synthesis via the formation of a secondary structure,

and then (seconds) it arranges itself into its final three-dimensional structure. This com

plex 3-D code has not yet been “cracked” either. Neither are the biophysical codes known

in detail, nor do we have powerful enough computers to predict the structure accurately. In

13.1 The Different Languages and Codes in a Cell